home *** CD-ROM | disk | FTP | other *** search
- >I could easily write a robot which would roam around the Web (perhaps
- >stochastically?), and verify the html, using sgmls. Then, whenever
- >I come across something that's non-compliant, I could automatically
- >send mail to wwwmaster@sitename. No one would have to annoy anyone else
- >about whether or not they've verified their HTML; a program would annoy
- >them automatically.
-
- I have written a robot that does this, except it doesn't check for
- valid SGML -- it just tries to map out the entire web. I believe I
- found roughly 50 or 60 different sites (this was maybe 2 months ago --
- I'm sorry, I didn't save the output). It took the robot about half a
- day (a saturday morning) to complete.
-
- There were several problems.
-
- First, some sites were down and my robot would spend a considerable
- time waiting for the connection to time out each time it found a link
- to such a site. I ended up remembering the last error from a site and
- skipping sites that were obviously down, but there are many different
- errors you can get, depending on whether the host is down,
- unreachable, doesn't run a WWW server, doesn't recognize the document
- address you want, or has some other trouble (some sites were going up
- and down while my robot was running, causing additional confusion).
-
- Next, more importantly, some sites have an infinite number of
- documents. There are several causes for this.
-
- First, several sites have gateways to the entire VMS documentation (I
- have never used VMS but apparently the VMS help system is a kind of
- hypertext). While not exactly infinite the number of nodes is *very*
- large. Luckily such gateways are easily recognized by the kind of
- pathname they use, and VMS help is unlikely to contain pointers to
- anything except more VMS help, so I put in a simple trap to stop
- these.
-
- Next, there are other gateways. I can't remember whether I
- encountered a Gopher or WAIS gateway, but these would have even worse
- problems.
-
- Finally, some servers contain bugs that cause loops, by referencing to
- the same document with an ever-growing path. (The relative path
- resolving rules are tricky, and I was using my own www client which
- isn't derived from Tim's, which made this more severe, but I have also
- found occurrences reproducible with the CERN www client.)
-
- Although I didn't specifically test for bad HTML, I did have to parse
- the HTML to find the links, and found occasional errors. I believe
- there are a few binaries, PostScript and WP files that have links to
- them, which take forever to fetch. There were also various
- occurrences of broken addresses here and there -- this was a good
- occasion for me to debug my www client library.
-
- If people are interested, I could run the robot again and report a
- summary of the results.
-
- I also ran a gopher robot, but after 1600 sites I gave up... The
- Veronica project in the Gopher world does the same and makes the
- results available as a database, although the last time I tried it the
- veronica server seemed too overloaded to respond to a simple query.
-
- If you want source for the robots, the're part of the Python source
- distribution: ftp to ftp.cwi.nl, directory, pub/python, file
- python0.9.8.tar.Z. The robot (and in fact my entire www and gopher
- client library) is in the tar archive in directory python/demo/www.
- The texinfo to html conversion program that I once advertized here is
- also there. (I'm sorry, you'll have to built the python interpreter
- from the source before any of these programs can be used...)
- Note that my www library isn't up with the latest HTML specs, this is
- a hobby project and I neede my time for other things...
-
- --Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
-
-